Decision Tree#
Note
Decision tree split node by one feature at each step.
ID3 split according to information gain.
C4.5 split according to information gain ratio.
CART split according to square error & gini.
Pruning tries to minimize leaves entropy and model’s complexity at the same time.
Information Gain ID3#
Suppose we have a classification problem, dataset \(D=\{(x^{(1)}, y^{(1)}),...,(x^{(n)}, y^{(n)})\}\), \(y^{(i)} \in \left\{1,...,k\right\}\).
Then the entropy of \(D\) which measures uncertainty:
 
Assume we partition \(D\) according to feature \(A\) into \(D_{1},...,D_{m}\), then the entropy of \(D\) after knowing \(A\):
Information gain is uncertainty loss:
Decision Tree ID3 choose feature \(A\) that maximize \(g(D,A)\) until:
- subset is empty 
- information gain \(g(D,A)\le\epsilon\) 
Information Gain Ratio C4.5#
Information gain prefer feature \(A\) such that \(\#A\) is large, more precisely,
is large.
Information gain ratio fix this by dividing \(H_{A}(D)\):
CART-classification and regression tree#
For regression problem, we try to find feature \(j\) and cutting point \(s\) that minimize the square error:
For classification problem CART uses gini:
 
Pruning#
Total entropy of these leaves:
We want to minimize this entropy, and at the same time limit model’s complexity.
Loss function:
If pruning a node result in a decrease of the loss function, then pruning the node.
Examples#
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, 2:]  # petal length and width
y = iris.target
"""hyperparameters"""
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(criterion="entropy",  # default gini
                                  # maximum depth of that tree
                                  max_depth=3,
                                  # maximum number of leaf nodes
                                  max_leaf_nodes=15,
                                  # maximum number of features when splitting each node
                                  max_features=2,
                                  # min number of samples of a node before it can split
                                  min_samples_split=8,
                                  # min number of samples of a leaf node
                                  min_samples_leaf=3,
                                  # same as min_samples_leaf, but by weight frac
                                  min_weight_fraction_leaf=0.01)
tree_clf.fit(X, y)
DecisionTreeClassifier(criterion='entropy', max_depth=3, max_features=2,
                       max_leaf_nodes=15, min_samples_leaf=3,
                       min_samples_split=8, min_weight_fraction_leaf=0.01)
"""visualize using graphviz, need 1.pip install graphviz, 2.brew install graphviz"""
from graphviz import Source
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
                out_file="iris_tree.dot",
                feature_names=iris.feature_names[2:],
                class_names=iris.target_names,
                rounded=True,
                filled=True
               )
Source.from_file("iris_tree.dot")
Grid Search#
from sklearn.model_selection import GridSearchCV
param_grid = [{"max_leaf_nodes": [2, 5], "min_samples_split": [3, 4]}]
clf = DecisionTreeClassifier()
grid_search = GridSearchCV(clf, param_grid, cv=3, verbose=3)
grid_search.fit(X, y)
grid_search.best_estimator_
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] max_leaf_nodes=2, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=3, score=0.660, total=   0.0s
[CV] max_leaf_nodes=2, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=3, score=0.660, total=   0.0s
[CV] max_leaf_nodes=2, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=3, score=0.660, total=   0.0s
[CV] max_leaf_nodes=2, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=4, score=0.660, total=   0.0s
[CV] max_leaf_nodes=2, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=4, score=0.660, total=   0.0s
[CV] max_leaf_nodes=2, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=2, min_samples_split=4, score=0.660, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=3, score=0.980, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=3, score=0.920, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=3 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=3, score=0.960, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=4, score=0.980, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=4, score=0.940, total=   0.0s
[CV] max_leaf_nodes=5, min_samples_split=4 ...........................
[CV]  max_leaf_nodes=5, min_samples_split=4, score=1.000, total=   0.0s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    0.0s finished
DecisionTreeClassifier(max_leaf_nodes=5, min_samples_split=4)
